Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CharRange and CharIter #111

Closed
wants to merge 12 commits into from
Closed

CharRange and CharIter #111

wants to merge 12 commits into from

Conversation

CAD97
Copy link
Collaborator

@CAD97 CAD97 commented Aug 12, 2017

Expanded documentation and tests are TODO -- working on those presently.

We can land the patch to use the char range in a different PR.

Expanded documentation and tests are TODO but this should work
@CAD97 CAD97 self-assigned this Aug 12, 2017
@CAD97 CAD97 requested a review from behnam August 12, 2017 01:30
license = "MIT/Apache-2.0"
keywords = ["text", "unicode", "iteration"]
description = "UNIC - CharRange"
categories = ["text-processing", "iteration"]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't actually checked that this is valid yet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no "iteration" category. Maybe we don't need more than one. Here's the list: https://crates.io/categories/

FYI, I've got "internationalization" (and "localization") added to the list, but it hasn't made it to the live version yet. When that happens, almost all UNIC components get both text-processing and internationalization.

CAD97 added 4 commits August 11, 2017 22:05
For some reason `allow(private_in_public)` is in the test code in 1.17.
This is an error when `forbid(future_incompatible)`. So I made it deny.
@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

Benchmark results from char-iter:

test benches::count          ... bench:   1,000,194 ns/iter (+/- 34,045)
test benches::count_baseline ... bench:     937,333 ns/iter (+/- 66,945)

Benchmark results from unic-char-range:

test count          ... bench:   2,855,694 ns/iter (+/- 174,285)
test count_baseline ... bench:     941,054 ns/iter (+/- 59,795)

... I've got a bit of work to do don't I ...

@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

I can not for the life of me figure out why char-iter is better than twice as fast as I am.

I'm going to try re-reimplementing this but closer to char-iter and see if I can't get the better performance.

@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

I figured out where I was losing time. In my experiments I think I found a cleaner for our uses design though... and it involves fewer moving parts (I think)

Copy link
Member

@behnam behnam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on getting the optimization right! 💯

Let me sleep on the type name and macro syntax. I don't want us to diverge from rustc concepts too much, so it's easier for new users to adapt, as well as putting things back into rustc. Specially, about the macro, I think we better support the common syntax, at least.

The code looks very good. A bunch of nits and a question inline.

repository = "https://github.com/behnam/rust-unic/"
license = "MIT/Apache-2.0"
keywords = ["text", "unicode", "iteration"]
description = "UNIC - CharRange"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better to use more informative language in description, as it's the only text that shows up on crates.io lists and item page, and allows better matching in search.

To follow the pattern from other UNIC components, it can be something like this:

description = "UNIC - Unicode Characters - Character Range and Iteration"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return None;
}

let char = unsafe { char::from_u32_unchecked(self.low) };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know it's possible to call a variable char! That's cool! But also uncommon practice, but I don't know why! Anyways, may be safer to just call it ch or chr.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 it's a bad habit. It's not something you'd really expect to work, but it does (I think because Rust does type and variable name lookup differently). Surprisingly though, my IDE highlights it right, which is part of why I do it even though I probably shouldn't.

}
let naive_len = self.high as usize - self.low as usize;
if self.low <= SURROGATE_RANGE.start && SURROGATE_RANGE.end <= self.high {
naive_len - SURROGATE_RANGE.len()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can optimize here by putting this number in a const variable, but could be unnecessary. Not sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately doing so and still calling len() would require const fn to be stable. I can see if precomputing it manually saves any non-negligible amount of time though!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out it takes 2 ns (+/- 0 ns) either way.


// ensure `high` is never one greater than a surrogate code point
if self.high == SURROGATE_RANGE.end {
self.high = SURROGATE_RANGE.start;
Copy link
Member

@behnam behnam Aug 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation with the new invariants is even more clean and slim! 👍

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll like #112 then 😄

//! ```
//!
#![forbid(bad_style, missing_debug_implementations, unconditional_recursion)]
#![deny(missing_docs, unsafe_code, unused, future_incompatible)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why unused, future_incompatible are in deny() but not forbid()? Doesn't look like we allow() them anywhere in this crate.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d24057a for future_incompatible. (It doesn't work on 1.17 for some reason.) unused, just because I was allowing it temporarily while building the library up.

Basically, I picked deny for things that could plausibly be temporarily allowed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I guess it's fine for non-major flags, and we should probably just rely on clippy (in a future version) to hint on all forbidable denys.


/// Does this range include a character?
#[inline]
pub fn contains(&self, char: char) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same about the char variable name.

@@ -0,0 +1,40 @@
use std::char;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a module here called char, I think it would be easier to read the code if we stick with std::char instead of using it as char. What do you think?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the problem is std::char and the primitive type char conflicting, that's an intended overlap. The char module contains things like char::from_u32 and char::MAX which should act like they are on the primitive char.

If you're saying unic-char, that's not referenced anywhere from my code. The only time there is a char module there is from inside the main unic crate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought we have a char.rs here next to step.rs. You're right, there's no conflict and it's fine. #aftermidnightreview

///
/// If the given `char` is already `char::MAX`, it is returned unchanged.
#[allow(unsafe_code)]
pub fn forward(c: char) -> char {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the same variable name here would make is easier to read.

/// If the given `char` is already `char::MAX`, it is returned unchanged.
#[allow(unsafe_code)]
pub fn forward(c: char) -> char {
if c == char::MAX {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, these checks result in an open_range() to silently create a CharRange with first > last. For example, if we go with open_range('\u{0x20}', '\u{0x21}'), then first == U+21andlast = U+20`. Does this violate any invariants, or we don't care?

(It doesn't look like causing any issues right now, but may be with expansions of the API.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same with .. syntax: you can create a "backwards" range and it just silently gives you a finished Range. We can put panic! guards around the constructors for improper use, but the standard library here has set a precedent of making an empty range.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Yeah, as long as the behavior is intended and works similar to std lib, it's fine.

I think it worth adding a hint about it in the constructors, though.

#[test]
fn iter_fused() {
let mut iter = CharRange::all().iter();
let mut fused = all_chars().into_iter().fuse();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to be gated behind the fused feature. Surprised that it's passing the CI!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fn to fuse an iterator is stable. It's merely the trait to declare yourself as fused that is unstable.

@behnam
Copy link
Member

behnam commented Aug 12, 2017

@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

Whoops, replaced it while you were reviewing... Much was shared though so it's not lost effort.

@CAD97
Copy link
Collaborator Author

CAD97 commented Aug 12, 2017

As for the macro, the syntax was a curveball idea. I'm sticking with just the standard .. and ..= offerings.

Copy link
Member

@behnam behnam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @CAD97! A couple of responses here, but I'll put everything possibly actionable on #112, as well.

//! ```
//!
#![forbid(bad_style, missing_debug_implementations, unconditional_recursion)]
#![deny(missing_docs, unsafe_code, unused, future_incompatible)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I guess it's fine for non-major flags, and we should probably just rely on clippy (in a future version) to hint on all forbidable denys.

@@ -0,0 +1,40 @@
use std::char;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought we have a char.rs here next to step.rs. You're right, there's no conflict and it's fine. #aftermidnightreview

/// If the given `char` is already `char::MAX`, it is returned unchanged.
#[allow(unsafe_code)]
pub fn forward(c: char) -> char {
if c == char::MAX {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Yeah, as long as the behavior is intended and works similar to std lib, it's fine.

I think it worth adding a hint about it in the constructors, though.

@CAD97 CAD97 deleted the unic-char-range branch August 13, 2017 08:14
bors bot added a commit that referenced this pull request Aug 13, 2017
112: [char/range] Add CharRange and CharIter r=behnam

The first half of adressing #91.
Closes #111, this manner of attack is better than it.

This PR only has one type, `CharRange`. It is effectively `std::ops::RangeInclusive` translated to characters. The matter of construction is handled by both half-open and closed constructors offered and a macro to allow for `'a'..='z'` syntax.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants